Online Shoppers Dataset

Data reading and cleaning

Importing the dataset

online_shoppers <- data.frame(read.csv("online_shoppers_intention.csv"))
str(online_shoppers)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
dim(online_shoppers)
## [1] 12330    18

Cleaning up the NAs

There are no NAs in the dataset, and no ommissions in the end.

which(is.na(online_shoppers))
## integer(0)
online_shoppers <- na.omit(online_shoppers)
str(online_shoppers)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

More clean up

Some of the features are not in their correct data type. We will factorize some of the variables. Some of the factored features have lots of levels, which could be a problem when designing models with dummies and increase computing expense.

online_shoppers$OperatingSystems = factor(online_shoppers$OperatingSystems)
online_shoppers$Browser = factor(online_shoppers$Browser)
online_shoppers$Region = factor(online_shoppers$Region)
online_shoppers$TrafficType = factor(online_shoppers$TrafficType)
online_shoppers$VisitorType = factor(online_shoppers$VisitorType)
online_shoppers$Weekend = factor(online_shoppers$Weekend)
online_shoppers$Revenue = factor(online_shoppers$Revenue)
# online_shoppers$Month = factor(online_shoppers$Month)
str(online_shoppers)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : Factor w/ 8 levels "1","2","3","4",..: 1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : Factor w/ 13 levels "1","2","3","4",..: 1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : Factor w/ 9 levels "1","2","3","4",..: 1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 2 1 1 2 1 1 ...
##  $ Revenue                : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...

EDA Pre-logistic Regression

Let’s do an initial mixed corrplot to see the relationship between the features and the possibility of generating revenue. We will only look at the numerical/ordinal factors without the factors.

library(corrplot)
online_shoppers_num = subset(online_shoppers, select = -c(Month, OperatingSystems, Browser, Region, TrafficType, VisitorType))
online_shoppers_num$Weekend = as.numeric(online_shoppers_num$Weekend)
online_shoppers_num$Revenue = as.numeric(online_shoppers_num$Revenue)
str(online_shoppers_num)
## 'data.frame':    12330 obs. of  12 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Weekend                : num  1 1 1 1 2 1 1 2 1 1 ...
##  $ Revenue                : num  1 1 1 1 1 1 1 1 1 1 ...
shop_corr <- cor(online_shoppers_num)
colnames(shop_corr) <- c("Adm", "A_D", "Info", "I_D", "PR", "PR_D", "BR", "ER", "PV", "SDay", "WE", "Rev")
corrplot.mixed(shop_corr)

We observed some very obvious correlation between things like the number of administrative pages visited and the amount of time each visitor spent on administrative pages, which is expected just like informational/informational_duration and productedRelated/productRelated_Duration.

The highest correlation with our response is the PageValues, which had a .49 correlation with Revenue. Again, this higher correlation makes sense because the PageValues feature is an average value/metric of the page visited before the user completed a transaction. This is expected, as pagevalues We will do some more EDA on that.

Other than that, the Bounce and Exit Rates both have a negative correlation with the likelihood of generating revenue, while users that visited productRelated pages seems to more likely generate revenue over users that visited admin pages.

From the t-test, we reject the null, and bring strong evidence to support the alternative hypothesis that the PageValue has a strong effect on whether or not the shopper buys.

online_shopper_bought <- online_shoppers[which(online_shoppers$Revenue==TRUE),]
online_shopper_not <- online_shoppers[which(online_shoppers$Revenue==FALSE),]
t.test(online_shopper_bought$PageValues, online_shopper_not$PageValues, conf.level = .95)
## 
##  Welch Two Sample t-test
## 
## data:  online_shopper_bought$PageValues and online_shopper_not$PageValues
## t = 31, df = 1954, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  23.7 26.9
## sample estimates:
## mean of x mean of y 
##     27.26      1.98

Logistic Regression

Basic Logistic Regression with PageValues and the three types of pages.

unloadPkg("caret")
attach(online_shoppers)
ShopLogit <- glm(Revenue ~ PageValues + ProductRelated + Informational + Administrative + ExitRates, data = online_shoppers, family = "binomial")
summary(ShopLogit)
## 
## Call:
## glm(formula = Revenue ~ PageValues + ProductRelated + Informational + 
##     Administrative + ExitRates, family = "binomial", data = online_shoppers)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.997  -0.473  -0.387  -0.199   3.409  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.04e+00   6.62e-02  -30.79   <2e-16 ***
## PageValues      7.99e-02   2.32e-03   34.42   <2e-16 ***
## ProductRelated  5.16e-03   5.89e-04    8.77   <2e-16 ***
## Informational   3.52e-02   2.17e-02    1.62     0.10    
## Administrative  5.55e-04   9.21e-03    0.06     0.95    
## ExitRates      -1.89e+01   1.62e+00  -11.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10624.8  on 12329  degrees of freedom
## Residual deviance:  7448.3  on 12324  degrees of freedom
## AIC: 7460
## 
## Number of Fisher Scoring iterations: 6
#kabledply(confusionMatrix(actual=ShopLogit$y,predicted=ShopLogit$fitted.values), title = "Confusion Matrix")
confusionMatrix(actual=ShopLogit$y,predicted=ShopLogit$fitted.values)
##       [,1] [,2]
## [1,] 10192 1205
## [2,]   230  703
prob = predict(ShopLogit, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.882
plot(h)

ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopLogit)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.299 (df=6)
pR2(ShopLogit)
## fitting null model for pseudo-r2
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -3724.131 -5312.398  3176.535     0.299     0.227     0.393
loadPkg("caret")

This model looks to be a good model already. The coefficients are all statistically significant aside from the Informational pages, and the AIC is 7675.8. The AUC is .8883, which is higher than 0.8, which is evidence that the predictors are doing well in classifying the dataset.

Let’s create testing and training data from Cross Validation to navigate around overfitting and possible variance in creating testing/training data

set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
ShopModelCV <- train(Revenue ~., data = online_shoppers, method = "glm", family = "binomial", trControl = train_control)
summary(ShopModelCV)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.142  -0.463  -0.324  -0.154   3.099  
## 
## Coefficients: (1 not defined because of singularities)
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.85e+00   2.17e-01   -8.55  < 2e-16 ***
## Administrative                3.04e-03   1.11e-02    0.27  0.78386    
## Administrative_Duration      -9.97e-05   1.95e-04   -0.51  0.60815    
## Informational                 3.16e-02   2.72e-02    1.16  0.24551    
## Informational_Duration        5.27e-05   2.24e-04    0.24  0.81402    
## ProductRelated                1.89e-03   1.17e-03    1.61  0.10661    
## ProductRelated_Duration       5.96e-05   2.73e-05    2.18  0.02931 *  
## BounceRates                  -1.69e+00   3.22e+00   -0.52  0.59988    
## ExitRates                    -1.53e+01   2.44e+00   -6.26  3.8e-10 ***
## PageValues                    8.19e-02   2.44e-03   33.60  < 2e-16 ***
## SpecialDay                   -9.51e-02   2.37e-01   -0.40  0.68837    
## MonthDec                     -7.11e-01   1.88e-01   -3.79  0.00015 ***
## MonthFeb                     -1.71e+00   6.43e-01   -2.65  0.00795 ** 
## MonthJul                      1.01e-01   2.20e-01    0.46  0.64620    
## MonthJune                    -3.37e-01   2.77e-01   -1.22  0.22410    
## MonthMar                     -5.82e-01   1.86e-01   -3.13  0.00174 ** 
## MonthMay                     -5.33e-01   1.76e-01   -3.03  0.00249 ** 
## MonthNov                      4.60e-01   1.68e-01    2.74  0.00611 ** 
## MonthOct                     -5.82e-02   2.05e-01   -0.28  0.77607    
## MonthSep                      1.58e-02   2.13e-01    0.07  0.94093    
## OperatingSystems2             1.55e-01   1.65e-01    0.94  0.34606    
## OperatingSystems3            -4.16e-02   1.77e-01   -0.23  0.81421    
## OperatingSystems4            -3.62e-02   1.81e-01   -0.20  0.84123    
## OperatingSystems5             3.56e-02   1.23e+00    0.03  0.97683    
## OperatingSystems6            -1.03e+00   9.47e-01   -1.09  0.27615    
## OperatingSystems7             1.17e+00   1.14e+00    1.02  0.30667    
## OperatingSystems8             3.03e-01   6.41e-01    0.47  0.63677    
## Browser2                     -1.20e-01   1.65e-01   -0.72  0.47022    
## Browser3                     -1.00e+00   5.71e-01   -1.76  0.07839 .  
## Browser4                     -5.01e-02   2.09e-01   -0.24  0.81046    
## Browser5                      8.74e-02   2.25e-01    0.39  0.69703    
## Browser6                     -4.06e-01   3.34e-01   -1.21  0.22489    
## Browser7                     -9.36e-02   5.06e-01   -0.18  0.85340    
## Browser8                      1.87e-01   3.18e-01    0.59  0.55625    
## Browser9                     -1.21e+01   1.46e+03   -0.01  0.99339    
## Browser10                     1.34e-01   2.95e-01    0.45  0.65082    
## Browser11                           NA         NA      NA       NA    
## Browser12                     1.79e+00   7.82e-01    2.30  0.02171 *  
## Browser13                     9.06e-02   9.89e-01    0.09  0.92700    
## Region2                       1.52e-01   1.12e-01    1.36  0.17377    
## Region3                      -9.74e-03   8.66e-02   -0.11  0.91050    
## Region4                      -5.19e-02   1.15e-01   -0.45  0.65106    
## Region5                      -2.33e-01   2.12e-01   -1.10  0.27021    
## Region6                       5.45e-02   1.34e-01    0.41  0.68362    
## Region7                      -4.42e-03   1.36e-01   -0.03  0.97397    
## Region8                       6.35e-02   1.75e-01    0.36  0.71727    
## Region9                      -2.68e-01   1.74e-01   -1.54  0.12349    
## TrafficType2                  1.84e-01   9.53e-02    1.93  0.05392 .  
## TrafficType3                 -2.31e-01   1.24e-01   -1.86  0.06315 .  
## TrafficType4                  4.72e-02   1.41e-01    0.34  0.73751    
## TrafficType5                  2.51e-01   2.15e-01    1.17  0.24270    
## TrafficType6                 -8.43e-02   1.99e-01   -0.42  0.67115    
## TrafficType7                  3.35e-01   4.77e-01    0.70  0.48263    
## TrafficType8                  6.13e-01   1.80e-01    3.41  0.00065 ***
## TrafficType9                  6.28e-03   6.80e-01    0.01  0.99263    
## TrafficType10                 3.83e-01   1.67e-01    2.30  0.02156 *  
## TrafficType11                 4.14e-01   2.20e-01    1.88  0.05961 .  
## TrafficType12                -1.20e+01   1.46e+03   -0.01  0.99342    
## TrafficType13                -5.28e-01   2.01e-01   -2.63  0.00855 ** 
## TrafficType14                -5.75e-01   1.13e+00   -0.51  0.61078    
## TrafficType15                -1.23e+01   2.19e+02   -0.06  0.95534    
## TrafficType16                 1.93e+00   1.24e+00    1.56  0.11907    
## TrafficType17                -1.17e+01   1.46e+03   -0.01  0.99356    
## TrafficType18                -1.25e+01   4.45e+02   -0.03  0.97764    
## TrafficType19                -1.15e+00   1.40e+00   -0.82  0.41143    
## TrafficType20                 4.82e-01   2.72e-01    1.77  0.07614 .  
## VisitorTypeOther             -6.56e-01   7.59e-01   -0.86  0.38737    
## VisitorTypeReturning_Visitor -2.11e-01   9.11e-02   -2.31  0.02083 *  
## WeekendTRUE                   8.86e-02   7.25e-02    1.22  0.22181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10624.8  on 12329  degrees of freedom
## Residual deviance:  7078.7  on 12262  degrees of freedom
## AIC: 7215
## 
## Number of Fisher Scoring iterations: 14
unloadPkg("caret")
xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values), title = "Confusion Matrix")
Confusion Matrix
1 2
10176 1164
246 744
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.9
plot(h)

ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.334 (df=68)
loadPkg("caret")

Let’s reduce the number of features in this model, particularly the less significant ones. Starting from the null model, let’s add the strongest predictors and move in both directions to achieve the best BIC.

full = ShopModelCV
null = glm(Revenue ~ 1, family = "binomial")
step(null, scope = list(lower=null, upper=full), direction = "both", criterion = "AIC")
## Start:  AIC=10627
## Revenue ~ 1
## 
##                           Df Deviance   AIC
## + PageValues               1     7889  7893
## + ExitRates                1     9666  9670
## + BounceRates              1    10088 10092
## + Month                    9    10244 10264
## + TrafficType             19    10235 10275
## + ProductRelated           1    10380 10384
## + ProductRelated_Duration  1    10391 10395
## + Administrative           1    10416 10420
## + VisitorType              2    10504 10510
## + SpecialDay               1    10518 10522
## + Informational            1    10531 10535
## + Administrative_Duration  1    10538 10542
## + OperatingSystems         7    10546 10562
## + Informational_Duration   1    10576 10580
## + Weekend                  1    10614 10618
## + Browser                 12    10595 10621
## <none>                          10625 10627
## + Region                   8    10615 10633
## 
## Step:  AIC=7893
## Revenue ~ PageValues
## 
##                           Df Deviance   AIC
## + Month                    9     7533  7555
## + ExitRates                1     7556  7562
## + BounceRates              1     7673  7679
## + ProductRelated           1     7679  7685
## + ProductRelated_Duration  1     7702  7708
## + TrafficType             19     7710  7752
## + Administrative           1     7804  7810
## + Informational            1     7834  7840
## + SpecialDay               1     7845  7851
## + Administrative_Duration  1     7853  7859
## + VisitorType              2     7855  7863
## + Informational_Duration   1     7858  7864
## + OperatingSystems         7     7866  7884
## + Weekend                  1     7879  7885
## <none>                           7889  7893
## + Browser                 12     7871  7899
## + Region                   8     7884  7904
## - PageValues               1    10625 10627
## 
## Step:  AIC=7555
## Revenue ~ PageValues + Month
## 
##                           Df Deviance   AIC
## + ExitRates                1     7240  7264
## + BounceRates              1     7333  7357
## + ProductRelated           1     7410  7434
## + ProductRelated_Duration  1     7418  7442
## + TrafficType             19     7408  7468
## + Administrative           1     7477  7501
## + Informational            1     7492  7516
## + VisitorType              2     7503  7529
## + Administrative_Duration  1     7507  7531
## + Informational_Duration   1     7509  7533
## + OperatingSystems         7     7505  7541
## + Weekend                  1     7527  7551
## + SpecialDay               1     7530  7554
## <none>                           7533  7555
## + Browser                 12     7516  7562
## + Region                   8     7526  7564
## - Month                    9     7889  7893
## - PageValues               1    10244 10264
## 
## Step:  AIC=7264
## Revenue ~ PageValues + Month + ExitRates
## 
##                           Df Deviance  AIC
## + ProductRelated_Duration  1     7186 7212
## + ProductRelated           1     7192 7218
## + TrafficType             19     7183 7245
## + Informational            1     7222 7248
## + Informational_Duration   1     7227 7253
## + Administrative           1     7231 7257
## + Administrative_Duration  1     7235 7261
## + OperatingSystems         7     7223 7261
## + VisitorType              2     7235 7263
## <none>                           7240 7264
## + Weekend                  1     7238 7264
## + BounceRates              1     7240 7266
## + SpecialDay               1     7240 7266
## + Browser                 12     7226 7274
## + Region                   8     7234 7274
## - ExitRates                1     7533 7555
## - Month                    9     7556 7562
## - PageValues               1     9369 9391
## 
## Step:  AIC=7212
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration
## 
##                           Df Deviance  AIC
## + TrafficType             19     7122 7186
## + VisitorType              2     7172 7202
## + Informational            1     7184 7212
## + Weekend                  1     7184 7212
## <none>                           7186 7212
## + BounceRates              1     7185 7213
## + ProductRelated           1     7185 7213
## + Informational_Duration   1     7185 7213
## + OperatingSystems         7     7173 7213
## + Administrative           1     7186 7214
## + SpecialDay               1     7186 7214
## + Administrative_Duration  1     7186 7214
## + Browser                 12     7170 7220
## + Region                   8     7180 7222
## - ProductRelated_Duration  1     7240 7264
## - ExitRates                1     7418 7442
## - Month                    9     7453 7461
## - PageValues               1     9331 9355
## 
## Step:  AIC=7186
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + 
##     TrafficType
## 
##                           Df Deviance  AIC
## + VisitorType              2     7116 7184
## + ProductRelated           1     7119 7185
## <none>                           7122 7186
## + Informational            1     7120 7186
## + Weekend                  1     7120 7186
## + Informational_Duration   1     7121 7187
## + BounceRates              1     7121 7187
## + Administrative           1     7121 7187
## + Administrative_Duration  1     7121 7187
## + SpecialDay               1     7121 7187
## + OperatingSystems         7     7111 7189
## + Browser                 12     7106 7194
## + Region                   8     7115 7195
## - TrafficType             19     7186 7212
## - ProductRelated_Duration  1     7183 7245
## - ExitRates                1     7294 7356
## - Month                    9     7355 7401
## - PageValues               1     9220 9282
## 
## Step:  AIC=7184
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + 
##     TrafficType + VisitorType
## 
##                           Df Deviance  AIC
## + ProductRelated           1     7113 7183
## <none>                           7116 7184
## + Informational            1     7114 7184
## + Weekend                  1     7115 7185
## + Informational_Duration   1     7115 7185
## + BounceRates              1     7116 7186
## - VisitorType              2     7122 7186
## + Administrative           1     7116 7186
## + Administrative_Duration  1     7116 7186
## + SpecialDay               1     7116 7186
## + OperatingSystems         7     7106 7188
## + Browser                 12     7101 7193
## + Region                   8     7109 7193
## - TrafficType             19     7172 7202
## - ProductRelated_Duration  1     7182 7248
## - ExitRates                1     7273 7339
## - Month                    9     7348 7398
## - PageValues               1     9209 9275
## 
## Step:  AIC=7183
## Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + 
##     TrafficType + VisitorType + ProductRelated
## 
##                           Df Deviance  AIC
## <none>                           7113 7183
## + Informational            1     7111 7183
## + Weekend                  1     7112 7184
## + Informational_Duration   1     7112 7184
## - ProductRelated           1     7116 7184
## + BounceRates              1     7112 7184
## + SpecialDay               1     7113 7185
## + Administrative_Duration  1     7113 7185
## + Administrative           1     7113 7185
## - VisitorType              2     7119 7185
## + OperatingSystems         7     7103 7187
## - ProductRelated_Duration  1     7120 7188
## + Browser                 12     7098 7192
## + Region                   8     7106 7192
## - TrafficType             19     7170 7202
## - ExitRates                1     7258 7326
## - Month                    9     7336 7388
## - PageValues               1     9209 9277
## 
## Call:  glm(formula = Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + 
##     TrafficType + VisitorType + ProductRelated, family = "binomial")
## 
## Coefficients:
##                  (Intercept)                    PageValues  
##                    -1.80e+00                      8.17e-02  
##                     MonthDec                      MonthFeb  
##                    -7.10e-01                     -1.78e+00  
##                     MonthJul                     MonthJune  
##                     1.07e-01                     -3.41e-01  
##                     MonthMar                      MonthMay  
##                    -5.73e-01                     -5.53e-01  
##                     MonthNov                      MonthOct  
##                     4.45e-01                     -6.32e-02  
##                     MonthSep                     ExitRates  
##                    -8.51e-03                     -1.63e+01  
##      ProductRelated_Duration                  TrafficType2  
##                     6.57e-05                      1.83e-01  
##                 TrafficType3                  TrafficType4  
##                    -2.51e-01                      4.32e-02  
##                 TrafficType5                  TrafficType6  
##                     2.52e-01                     -8.73e-02  
##                 TrafficType7                  TrafficType8  
##                     3.48e-01                      5.83e-01  
##                 TrafficType9                 TrafficType10  
##                    -3.11e-02                      3.50e-01  
##                TrafficType11                 TrafficType12  
##                     4.07e-01                     -1.19e+01  
##                TrafficType13                 TrafficType14  
##                    -6.02e-01                     -4.32e-01  
##                TrafficType15                 TrafficType16  
##                    -1.23e+01                      1.95e+00  
##                TrafficType17                 TrafficType18  
##                    -1.18e+01                     -1.25e+01  
##                TrafficType19                 TrafficType20  
##                    -1.14e+00                      4.58e-01  
##             VisitorTypeOther  VisitorTypeReturning_Visitor  
##                    -5.83e-01                     -2.14e-01  
##               ProductRelated  
##                     2.04e-03  
## 
## Degrees of Freedom: 12329 Total (i.e. Null);  12295 Residual
## Null Deviance:       10600 
## Residual Deviance: 7110  AIC: 7180

Now we have our full logistic model, which is glm(formula = Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated, family = “binomial”)

Let’s run it through the same cross validation training/testing data.

ShopModelCV <- train(Revenue ~ PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated, data = online_shoppers, method = "glm", family = "binomial", trControl = train_control)
summary(ShopModelCV)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.146  -0.464  -0.328  -0.155   3.126  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.80e+00   1.94e-01   -9.27  < 2e-16 ***
## PageValues                    8.17e-02   2.40e-03   34.04  < 2e-16 ***
## MonthDec                     -7.10e-01   1.86e-01   -3.81  0.00014 ***
## MonthFeb                     -1.78e+00   6.44e-01   -2.76  0.00580 ** 
## MonthJul                      1.07e-01   2.19e-01    0.49  0.62534    
## MonthJune                    -3.41e-01   2.77e-01   -1.23  0.21798    
## MonthMar                     -5.73e-01   1.85e-01   -3.10  0.00194 ** 
## MonthMay                     -5.53e-01   1.71e-01   -3.23  0.00123 ** 
## MonthNov                      4.45e-01   1.67e-01    2.67  0.00758 ** 
## MonthOct                     -6.32e-02   2.04e-01   -0.31  0.75640    
## MonthSep                     -8.51e-03   2.13e-01   -0.04  0.96809    
## ExitRates                    -1.63e+01   1.67e+00   -9.77  < 2e-16 ***
## ProductRelated_Duration       6.57e-05   2.61e-05    2.52  0.01182 *  
## TrafficType2                  1.83e-01   9.39e-02    1.94  0.05195 .  
## TrafficType3                 -2.51e-01   1.23e-01   -2.04  0.04129 *  
## TrafficType4                  4.32e-02   1.39e-01    0.31  0.75672    
## TrafficType5                  2.52e-01   2.13e-01    1.18  0.23659    
## TrafficType6                 -8.73e-02   1.97e-01   -0.44  0.65807    
## TrafficType7                  3.48e-01   4.72e-01    0.74  0.46088    
## TrafficType8                  5.83e-01   1.77e-01    3.29  0.00101 ** 
## TrafficType9                 -3.11e-02   6.71e-01   -0.05  0.96303    
## TrafficType10                 3.50e-01   1.65e-01    2.12  0.03402 *  
## TrafficType11                 4.07e-01   2.10e-01    1.94  0.05230 .  
## TrafficType12                -1.19e+01   1.46e+03   -0.01  0.99348    
## TrafficType13                -6.02e-01   1.98e-01   -3.04  0.00235 ** 
## TrafficType14                -4.32e-01   1.07e+00   -0.40  0.68668    
## TrafficType15                -1.23e+01   2.21e+02   -0.06  0.95539    
## TrafficType16                 1.95e+00   1.24e+00    1.58  0.11520    
## TrafficType17                -1.18e+01   1.46e+03   -0.01  0.99354    
## TrafficType18                -1.25e+01   4.45e+02   -0.03  0.97758    
## TrafficType19                -1.14e+00   1.41e+00   -0.81  0.41803    
## TrafficType20                 4.58e-01   2.61e-01    1.76  0.07901 .  
## VisitorTypeOther             -5.83e-01   5.43e-01   -1.07  0.28311    
## VisitorTypeReturning_Visitor -2.14e-01   9.03e-02   -2.37  0.01789 *  
## ProductRelated                2.04e-03   1.12e-03    1.83  0.06794 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10624.8  on 12329  degrees of freedom
## Residual deviance:  7112.9  on 12295  degrees of freedom
## AIC: 7183
## 
## Number of Fisher Scoring iterations: 14

The new AIC of the model is 7182.9, which is a noticeable improvement on the original full model of 7214.7. Let’s apply it to the test data (4:1 on random seed 123) and see how it does.

unloadPkg("caret")
xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values), title = "Confusion Matrix")
Confusion Matrix
1 2
10177 1168
245 740
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.899
plot(h)

ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.331 (df=35)

Finally, let’s move the cutoff value to get a high sensitivity without sacrificing too much accuracy. Let’s try lowering it to .2.

xkabledply(confusionMatrix(actual=ShopModelCV$finalModel$y,predicted=ShopModelCV$finalModel$fitted.values, cutoff=.2), title = "Confusion Matrix")
Confusion Matrix
1 2
9331 561
1091 1347
prob = predict(ShopModelCV$finalModel, type = "response")
h <- roc(Revenue~prob)
auc(h)
## Area under the curve: 0.899
plot(h)

ShopperNullLogit <- glm(Revenue ~ 1, family = "binomial")
mcFadden = 1 - logLik(ShopModelCV$finalModel)/logLik(ShopperNullLogit)
mcFadden
## 'log Lik.' 0.331 (df=35)

Interpretation

All of the models had generally high accuracies and AUCs, but I would choose the final model as it has the lowest AIC out of the three models. While the McFadden and AUC of the full model was slightly higher than that of the last model, as the complexity is much less with only 7 out of the possible 17 different features.

Classification Tree and Oversampling

#First, load the file
df_orig = data.frame(read.csv("online_shoppers_intention.csv"))
head(df_orig)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                    0.00      0.2000    0.2000          0
## 2              2                   64.00      0.0000    0.1000          0
## 3              1                    0.00      0.2000    0.2000          0
## 4              2                    2.67      0.0500    0.1400          0
## 5             10                  627.50      0.0200    0.0500          0
## 6             19                  154.22      0.0158    0.0246          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType
## 1          0   Feb                1       1      1           1
## 2          0   Feb                2       2      1           2
## 3          0   Feb                4       1      9           3
## 4          0   Feb                3       2      2           4
## 5          0   Feb                3       3      1           4
## 6          0   Feb                2       2      1           3
##         VisitorType Weekend Revenue
## 1 Returning_Visitor   FALSE   FALSE
## 2 Returning_Visitor   FALSE   FALSE
## 3 Returning_Visitor   FALSE   FALSE
## 4 Returning_Visitor   FALSE   FALSE
## 5 Returning_Visitor    TRUE   FALSE
## 6 Returning_Visitor   FALSE   FALSE
colSums(is.na(df_orig))
##          Administrative Administrative_Duration           Informational 
##                       0                       0                       0 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                       0                       0                       0 
##             BounceRates               ExitRates              PageValues 
##                       0                       0                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0

No null values in the dataset.

table(df_orig$Revenue)
## 
## FALSE  TRUE 
## 10422  1908

We can see there is an imbalance in dataset. True values are less than 20% as compares to False.

Factoring categorical variables

df = data.frame(df_orig)

df$VisitorType = factor(df$VisitorType)
df$Weekend = factor(df$Weekend)
df$Revenue = factor(df$Revenue)
df$Month = factor(df$Month)

df[, c('VisitorType', 'Weekend', 'Revenue', 'Month')] <- sapply(df[, c('VisitorType', 'Weekend', 'Revenue', 'Month')], unclass)

df$VisitorType = factor(df$VisitorType)
df$Weekend = factor(df$Weekend)
df$Revenue = factor(df$Revenue)
df$Month = factor(df$Month)

head(df)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       0             0                      0
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1              1                    0.00      0.2000    0.2000          0
## 2              2                   64.00      0.0000    0.1000          0
## 3              1                    0.00      0.2000    0.2000          0
## 4              2                    2.67      0.0500    0.1400          0
## 5             10                  627.50      0.0200    0.0500          0
## 6             19                  154.22      0.0158    0.0246          0
##   SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType
## 1          0     3                1       1      1           1           3
## 2          0     3                2       2      1           2           3
## 3          0     3                4       1      9           3           3
## 4          0     3                3       2      2           4           3
## 5          0     3                3       3      1           4           3
## 6          0     3                2       2      1           3           3
##   Weekend Revenue
## 1       1       1
## 2       1       1
## 3       1       1
## 4       1       1
## 5       2       1
## 6       1       1

EDA for last 6 variables

ggplot(df_orig, aes(x=OperatingSystems, color=Revenue, fill=Revenue)) +
  # geom_histogram(bins = 8, fill="pink") + 
  # geom_freqpoly(bins=8, color="red") +
  geom_histogram(binwidth = 0.5, alpha=0.6, position = 'identity') +
  scale_color_viridis(discrete=TRUE) +
  scale_fill_viridis(discrete=TRUE) +
  scale_x_continuous(breaks=1:8) + scale_y_continuous(n.breaks=6) +
  theme_ipsum()  

ggplot(df_orig, aes(x=Browser, color=Revenue, fill=Revenue)) +
  geom_histogram(binwidth = 0.5, alpha=0.6, position = 'identity') +
  scale_x_continuous(breaks=1:13, ) + scale_y_continuous(n.breaks=6) +
  theme_ipsum() 

ggplot(df_orig, aes(x=TrafficType, color=Revenue, fill=Revenue)) +
  geom_histogram(binwidth = 1, alpha=0.6) +
  scale_fill_manual(values=c("orange", "blue")) +
  scale_x_continuous(n.breaks=20) + scale_y_continuous(n.breaks=6) + 
  theme_ipsum()  

ggplot(df_orig, aes(x=VisitorType, color=Revenue, fill=Revenue)) +
  geom_histogram(alpha=0.6, stat="count") +
  scale_fill_manual(values=c("pink", "green")) + scale_color_manual(values=c("pink", "green")) +
  theme_ipsum()  

ggplot(df_orig, aes(x=Weekend, color=Revenue, fill=Revenue)) +
  geom_histogram(alpha=0.6, stat="count") +
  scale_fill_manual(values=c("cyan", "gray")) + scale_color_manual(values=c("cyan", "gray")) +
  theme_ipsum() 

ggplot(df_orig, aes(x=Region, color=Revenue, fill=Revenue)) +
  geom_histogram(binwidth = 1, alpha=0.6, stat="count") + stat_bin(center=1) +
  scale_x_continuous(breaks=1:13, ) + scale_y_continuous(n.breaks=6) + 
  theme_ipsum() 

Plotting highlights the imbalance between positive and negative classes in our response variable (Revenue). Also distribution is right skewed.

Inferential Statistics

We will perform Chi-Square test on all the last 6 categorical variables.

os_contable = table(df_orig$Revenue, df_orig$OperatingSystems)
os_contable
##        
##            1    2    3    4    5    6    7    8
##   FALSE 2206 5446 2287  393    5   17    6   62
##   TRUE   379 1155  268   85    1    2    1   17
chisq.test(os_contable, simulate.p.value=T)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  os_contable
## X-squared = 75, df = NA, p-value = 5e-04
br_contable = table(df_orig$Revenue, df_orig$Browser)
br_contable
##        
##            1    2    3    4    5    6    7    8    9   10   11   12   13
##   FALSE 2097 6738  100  606  381  154   43  114    1  131    5    7   45
##   TRUE   365 1223    5  130   86   20    6   21    0   32    1    3   16
chisq.test(br_contable, simulate.p.value=T)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  br_contable
## X-squared = 28, df = NA, p-value = 0.008
rg_contable = table(df_orig$Revenue, df_orig$Region)
rg_contable
##        
##            1    2    3    4    5    6    7    8    9
##   FALSE 4009  948 2054 1007  266  693  642  378  425
##   TRUE   771  188  349  175   52  112  119   56   86
chisq.test(rg_contable, simulate.p.value=F)
## 
##  Pearson's Chi-squared test
## 
## data:  rg_contable
## X-squared = 9, df = 8, p-value = 0.3
Tt_contable = table(df_orig$Revenue, df_orig$TrafficType)
Tt_contable
##        
##            1    2    3    4    5    6    7    8    9   10   11   12   13   14
##   FALSE 2189 3066 1872  904  204  391   28  248   38  360  200    1  695   11
##   TRUE   262  847  180  165   56   53   12   95    4   90   47    0   43    2
##        
##           15   16   17   18   19   20
##   FALSE   38    2    1   10   16  148
##   TRUE     0    1    0    0    1   50
chisq.test(Tt_contable, simulate.p.value=T)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  Tt_contable
## X-squared = 373, df = NA, p-value = 5e-04
Vt_contable = table(df_orig$Revenue, df_orig$VisitorType)
Vt_contable
##        
##         New_Visitor Other Returning_Visitor
##   FALSE        1272    69              9081
##   TRUE          422    16              1470
chisq.test(Vt_contable, simulate.p.value=F)
## 
##  Pearson's Chi-squared test
## 
## data:  Vt_contable
## X-squared = 135, df = 2, p-value <2e-16
wk_contable = table(df_orig$Revenue, df_orig$Weekend)
wk_contable
##        
##         FALSE TRUE
##   FALSE  8053 2369
##   TRUE   1409  499
chisq.test(wk_contable, simulate.p.value=F)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  wk_contable
## X-squared = 10, df = 1, p-value = 0.001

From above testing, all the variables except Region are statistically significant (p-value < 0.05). Region has p-value of 0.3 (>0.05) making it statistically insignificant. Hence we will omit the Region variable from our modeling.

set.seed(1234)
df_sample <- sample(2, nrow(df), replace=TRUE, prob=c(0.80, 0.20))

train <- df[df_sample==1, ]
# test_X <- df[df_sample==2, 1:17]
# 
# train_Y <- df[df_sample==1, 18]
test <- df[df_sample==2, ]

Classification Tree Modeling and Hyperparameter Tuning

We will choose “maxdepth” parameter for our decision tree. For this we will be using original unbalanced dataset only.

loadPkg("rpart")
loadPkg("caret")
confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )

for (deep in 2:7) {
  tree_fit <- rpart(Revenue ~ . - Region, data=train, method="class", control = list(maxdepth = deep, cp=0.005))
  # 
  cm = confusionMatrix(predict(tree_fit, test, type = "class"), reference = test$Revenue, positive = "2") # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

xkabledply(confusionMatrixResultDf, title="Kyphosis Classification Trees summary with varying MaxDepth")
Kyphosis Classification Trees summary with varying MaxDepth
Depth Accuracy Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision Recall F1 Prevalence Detection.Rate Detection.Prevalence Balanced.Accuracy
2 0.876 0.387 0.972 0.724 0.891 0.724 0.387 0.504 0.163 0.0628 0.0867 0.679
3 0.881 0.571 0.941 0.652 0.919 0.652 0.571 0.609 0.163 0.0928 0.1422 0.756
4 0.886 0.539 0.954 0.695 0.914 0.695 0.539 0.607 0.163 0.0875 0.1260 0.746
5 0.885 0.584 0.944 0.669 0.921 0.669 0.584 0.623 0.163 0.0948 0.1418 0.764
6 0.889 0.546 0.956 0.704 0.916 0.704 0.546 0.615 0.163 0.0887 0.1260 0.751
7 0.889 0.546 0.956 0.704 0.916 0.704 0.546 0.615 0.163 0.0887 0.1260 0.751

Maxdepth equals 6 is giving highest accuracy, precision, and recall. So we will use to perform further evaluation using maxdepth as 6.

dtree_model <- rpart(Revenue ~ . - Region, data=train, method="class", control = list(maxdepth = 6, cp=0.005) )
confusionMatrix( predict(dtree_model, test, type = "class"), reference = test[, "Revenue"], positive = "2")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2
##          1 1975  182
##          2   92  219
##                                         
##                Accuracy : 0.889         
##                  95% CI : (0.876, 0.901)
##     No Information Rate : 0.838         
##     P-Value [Acc > NIR] : 2.25e-13      
##                                         
##                   Kappa : 0.552         
##                                         
##  Mcnemar's Test P-Value : 7.59e-08      
##                                         
##             Sensitivity : 0.5461        
##             Specificity : 0.9555        
##          Pos Pred Value : 0.7042        
##          Neg Pred Value : 0.9156        
##              Prevalence : 0.1625        
##          Detection Rate : 0.0887        
##    Detection Prevalence : 0.1260        
##       Balanced Accuracy : 0.7508        
##                                         
##        'Positive' Class : 2             
## 

We are getting 88.9 as accuracy for above tree model. But we can observe sensitivity or recall is low because it depends on False Negative (FN) that is 182.

plotcp(dtree_model)

Above graph of relative error vs. depth/cp shows that after maxdepth equals 6 there is saturation in error.

Plotting our tree

loadPkg("rattle") # For fancyRpartPlot (Trees) Answer "no" on installing from binary source
fancyRpartPlot(dtree_model)

library(pROC)

prob = predict(dtree_model, test, type = "class")
# df$prob=prob
h <- roc(test$Revenue ~ as.numeric(prob))
auc(h) # area-under-curve prefer 0.8 or higher.
## Area under the curve: 0.751
plot(h)

Above ROC-AUC analysis shows that imbalance in dataset is not giving good AUC score (<0.8). This is because specificity is much higher than sensitivity.

Oversampling dataset

Now we will treat our unbalance dataset. True values are very less as compared to False. So we will oversample True observation of Revenue equal to False value for each corresponding variable in dataset.

train_os = data.frame(train)
train_new <- ovun.sample(Revenue ~ ., data = train_os, method = "over",N = 16710)$data
table(train_new$Revenue)
## 
##    1    2 
## 8355 8355

Above table shows that True and False observations are equal.

Modeling and Tuning on new balance dataset

confusionMatrixResultDf = data.frame( Depth=numeric(0), Accuracy= numeric(0), Sensitivity=numeric(0), Specificity=numeric(0), Pos.Pred.Value=numeric(0), Neg.Pred.Value=numeric(0), Precision=numeric(0), Recall=numeric(0), F1=numeric(0), Prevalence=numeric(0), Detection.Rate=numeric(0), Detection.Prevalence=numeric(0), Balanced.Accuracy=numeric(0), row.names = NULL )

for (deep in 2:10) {
  tree_fit <- rpart(Revenue ~ . - Region, data=train_new, method="class", control = list(maxdepth = deep, cp=0.001) )
  # 
  cm = confusionMatrix( predict(tree_fit, test, type = "class"), reference = test[, "Revenue"], positive = "2") # from caret library
  # 
  cmaccu = cm$overall['Accuracy']
  # print( paste("Total Accuracy = ", cmaccu ) )
  # 
  cmt = data.frame(Depth=deep, Accuracy = cmaccu, row.names = NULL ) # initialize a row of the metrics 
  cmt = cbind( cmt, data.frame( t(cm$byClass) ) ) # the dataframe of the transpose, with k valued added in front
  confusionMatrixResultDf = rbind(confusionMatrixResultDf, cmt)
  # print("Other metrics : ")
}

xkabledply(confusionMatrixResultDf, title="Kyphosis Classification Trees summary with varying MaxDepth")
Kyphosis Classification Trees summary with varying MaxDepth
Depth Accuracy Sensitivity Specificity Pos.Pred.Value Neg.Pred.Value Precision Recall F1 Prevalence Detection.Rate Detection.Prevalence Balanced.Accuracy
2 0.857 0.748 0.879 0.544 0.947 0.544 0.748 0.630 0.163 0.122 0.223 0.813
3 0.838 0.785 0.848 0.501 0.953 0.501 0.785 0.612 0.163 0.128 0.255 0.817
4 0.841 0.785 0.852 0.507 0.953 0.507 0.785 0.616 0.163 0.128 0.252 0.819
5 0.842 0.798 0.851 0.510 0.956 0.510 0.798 0.622 0.163 0.130 0.255 0.825
6 0.842 0.803 0.849 0.508 0.957 0.508 0.803 0.622 0.163 0.131 0.257 0.826
7 0.836 0.805 0.842 0.498 0.957 0.498 0.805 0.615 0.163 0.131 0.263 0.824
8 0.841 0.788 0.851 0.506 0.954 0.506 0.788 0.617 0.163 0.128 0.253 0.820
9 0.842 0.788 0.853 0.510 0.954 0.510 0.788 0.619 0.163 0.128 0.251 0.821
10 0.842 0.788 0.852 0.509 0.954 0.509 0.788 0.618 0.163 0.128 0.252 0.820
dtree_model <- rpart(Revenue ~ . - Region, data=train_new, method="class", control = list(maxdepth = 2, cp=0.001) )

confusionMatrix(predict(dtree_model, test, type = "class"), reference = test$Revenue, positive = "2")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2
##          1 1816  101
##          2  251  300
##                                         
##                Accuracy : 0.857         
##                  95% CI : (0.843, 0.871)
##     No Information Rate : 0.838         
##     P-Value [Acc > NIR] : 0.00362       
##                                         
##                   Kappa : 0.545         
##                                         
##  Mcnemar's Test P-Value : 1.99e-15      
##                                         
##             Sensitivity : 0.748         
##             Specificity : 0.879         
##          Pos Pred Value : 0.544         
##          Neg Pred Value : 0.947         
##              Prevalence : 0.162         
##          Detection Rate : 0.122         
##    Detection Prevalence : 0.223         
##       Balanced Accuracy : 0.813         
##                                         
##        'Positive' Class : 2             
## 

After oversampling our accuracy is 85.7, which is lower than from previous model with unbalanced dataset.

prob = predict(dtree_model, test, type = "class")
# df$prob=prob
h <- roc(test$Revenue ~ as.numeric(prob))
auc(h) # area-under-curve prefer 0.8 or higher.
## Area under the curve: 0.813
plot(h)

Oversampling causes sensitivity to increase because FN is increased. AUC score is 81.3 now (>0.8).

fancyRpartPlot(dtree_model)

unloadPkg("Caret")

Aaron’ EDA and KNN Model’

#First, load the file
online = data.frame(read.csv("online_shoppers_intention.csv"))
# We will use this dataframe later for model manipulation
original_online = data.frame(read.csv("online_shoppers_intention.csv"))
total_observations = nrow(online)
total_observations
head(online)
# Handling categorical variables prior to plotting
online$Month = as.factor(online$Month)
online$Revenue = as.factor(online$Revenue)
online$Weekend = as.factor(online$Weekend)
online$VisitorType = as.factor(online$VisitorType)

# EDA of Barplots and Density

monthFrequency = ggplot(online, aes(x=Month, color=Month, fill=Month)) + geom_bar() + labs(title="Count of Months",
        x ="Month", y = "Count")
monthFrequency

userCount = ggplot(online, aes(x=VisitorType, color=VisitorType, fill=VisitorType)) + geom_bar() + labs(title="Visitor Type Count",
        x ="Visitor Types", y = "Count")
userCount

revCount = ggplot(online, aes(x=Revenue, color=Revenue, fill=Revenue)) + geom_bar() + labs(title="Revenue Count",
        x ="Revenue", y = "Count")
revCount

revTrue = subset(online, online$Revenue == TRUE)
revFalse = subset(online, online$Revenue == FALSE)
pieNum <- c(nrow(revTrue), nrow(revFalse))
piepercent<- round(100*pieNum/sum(pieNum), 1)
pie(pieNum, labels = piepercent, main = "Revenue Percentages",col = rainbow(length(pieNum)))
legend("topright", c("True","False"), cex = 0.8, fill = rainbow(length(pieNum)))

# Special Day data
specialDay = ggplot(online, aes(SpecialDay)) + geom_density(color="darkblue", fill="lightblue") + labs(title="Density Chart of Special Day",x ="Special Day", y = "Density")
specialDay

# scatterSpecialDay = ggplot(online, aes(Revenue, SpecialDay)) + geom_violin(), this plot had no significance

# Histograms of Individual Pages of the E-commerce Site

admin = ggplot(online, aes(Administrative, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Administrative", x ="Administrative", y = "Count")
admin

info = ggplot(online, aes(Informational, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Informational", x ="Informational", y = "Count")
info

prod = ggplot(online, aes(ProductRelated, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Product Related", x ="Product Related", y = "Count")
prod

#
# Duration of the above Pages
#

adminDur = ggplot(online, aes(Administrative_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Administrative Duration", x ="Administrative Duration", y = "Count")
adminDur

infoDur = ggplot(online, aes(Informational_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Informational Duration", x ="Informational Duration", y = "Count")
infoDur

prodDur = ggplot(online, aes(ProductRelated_Duration, color=Revenue, fill=Revenue)) + geom_histogram(bins=30) + labs(title="Count of Product Related Duration", x ="Product Related Duration", y = "Count")
prodDur

# Boxplots of User Bounce and Exit Rates, Page Value Plots of Revenue and No Revenue

# Bounce Rate is measured by total one page visits divided by total visits 
bounceBox = ggplot(online, aes(BounceRates)) + geom_boxplot(color='blue') + labs(title="Boxplot of Bounce Rate", x ="Bounce Rate")
bounceBox

# Exit Rate is measured by total exits from page divided by total visits to page
exitBox = ggplot(online, aes(ExitRates)) + geom_boxplot(color='orange') + labs(title="Boxplot of Exit Rate", x ="Exit Rate")
exitBox

# Page Value is the average value for a page that a user visited before landing on the goal page or completing an Ecommerce transaction
pageBox_rev = ggplot(revTrue, aes(PageValues)) + geom_boxplot(color='green') + labs(title="Boxplot of Page Values with Revenue", x ="Page Values")
pageBox_rev

pageBox_no_rev = ggplot(revFalse, aes(PageValues)) + geom_boxplot(color='red') + labs(title="Boxplot of Page Values of No Revenue Rows", x ="Page Values")
pageBox_no_rev

The average Bounce Rate for this site is 2.219%. The average Exit Rate for this site is 4.307%. Page Values that had revenue had an average of 27.265. Page Values that had did not have revenue had an average of 1.976. For our dataset it would be better to see individual page exit and bounce rates. The column variables Exit and Bounce Rates are the average of these metrics over each page on this site.

# Corrplot of Vaariables

# Corrplot after changing categorical to numerical
online_new = subset(online, select = -c(VisitorType, Month))
online_new$Weekend = as.numeric(online_new$Weekend)
online_new$Revenue = as.numeric(online_new$Revenue)
corrOnline <- cor(online_new)
corrplot(corrOnline, method="number",number.cex=0.5,tl.cex=.6)

# Chi-Squared Test for Months and Special Day against Revenue

# Table for Month and Revenue of that Month
monthTable = xtabs( ~ Revenue + Month, data = online)
monthTable
chisqMonth = chisq.test(monthTable)
chisqMonth
# Table for Special Day and Revenue as we get closer to that Day
# This value represents the closeness of the browsing date to special days or holidays
specialDayTable = xtabs( ~ Revenue + SpecialDay, data = online)
specialDayTable
chisqSD = chisq.test(specialDayTable)
chisqSD
library(tidyr)
# Rehandle variables for knn model
online$VisitorType = as.factor(online$VisitorType)
online$Weekend = as.integer(online$Weekend)
online$OperatingSystems = as.integer(online$OperatingSystems)
online$Browser = as.integer(online$Browser)
online$TrafficType = as.integer(online$TrafficType)
online$Region = as.integer(online$Region)
online$Month = as.integer(factor(online$Month, levels = month.abb))
online = drop_na(online) # dropping wrong month rows
# Making Visitor Type into numerical variable
library(tidyverse)
online$VisitorType <- str_replace(online$VisitorType,'New_Visitor','1')
online$VisitorType <- str_replace(online$VisitorType,'Other','2')
online$VisitorType <- str_replace(online$VisitorType,'Returning_Visitor','3')
online$VisitorType = as.numeric(online$VisitorType)
# handling the variables for the specific variables
# PageValues + Month + ExitRates + ProductRelated_Duration + TrafficType + VisitorType + ProductRelated
original_online$VisitorType = as.factor(original_online$VisitorType)
original_online$Weekend = as.integer(original_online$Weekend)
original_online$OperatingSystems = as.integer(original_online$OperatingSystems)
original_online$Browser = as.integer(original_online$Browser)
original_online$TrafficType = as.integer(original_online$TrafficType)
original_online$Region = as.integer(original_online$Region)
original_online$Month = as.integer(factor(original_online$Month, levels = month.abb))
original_online = drop_na(online)
# Making Visitor Type into numerical variable

original_online$VisitorType <- str_replace(original_online$VisitorType,'New_Visitor','1')
original_online$VisitorType <- str_replace(original_online$VisitorType,'Other','2')
original_online$VisitorType <- str_replace(original_online$VisitorType,'Returning_Visitor','3')
original_online$VisitorType = as.numeric(original_online$VisitorType)

original_online = subset(original_online, select = -c(1,2,3,4,7,10,12,13,14,17))

# Splitting Data and Training Data

scaledOnline <- as.data.frame(scale(online[1:17], center = TRUE, scale = TRUE))
# Selected Variables based off EDA
scaled_Online_select <- as.data.frame(scale(original_online[1:7], center = TRUE, scale = TRUE))
set.seed(123)
online_sample <- sample(2, nrow(scaledOnline), replace=TRUE, prob=c(0.80, 0.20))
# Selected Variables Model based off EDA
set.seed(123)
online_sample_selected <- sample(2, nrow(scaled_Online_select), replace=TRUE, prob=c(0.8, 0.2))
online_training <- scaledOnline[online_sample==1, 1:17]
online_test <- scaledOnline[online_sample==2, 1:17]
# Selected Variables Model based off EDA
online_training_selected <- scaled_Online_select[online_sample_selected==1, 1:7]
online_test_selected <- scaled_Online_select[online_sample_selected==2, 1:7]
online_trainLabels <- online[online_sample==1, 18]
online_testLabels <- online[online_sample==2, 18]
# Selected Variables Model based off EDA
online_trainLabels_selected <- original_online[online_sample_selected==1, 8]
online_testLabels_selected <- original_online[online_sample_selected==2, 8]
ResultDf = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )

ResultDf_selected = data.frame( k=numeric(0), Total.Accuracy= numeric(0), row.names = NULL )

# KNN Models both Full Model and Specific Variable

library(class)
library(gmodels)
for (kval in 5:25) {
  online_pred <- knn(train = online_training, test = online_test, cl=online_trainLabels, k=kval)
  onlinePREDCross <- CrossTable(online_testLabels, online_pred, prop.chisq = FALSE)
  print( paste("k = ", kval) )
  onlinePREDCross
  library(caret)
  cm = confusionMatrix(online_pred, reference = online_testLabels ) 
  cmaccu = cm$overall['Accuracy']
  print( paste("Total Accuracy = ", cmaccu ) )
   
  cmt = data.frame(k=kval, Total.Accuracy = cmaccu, row.names = NULL )  
  ResultDf = rbind(ResultDf, cmt)
  print( xkabledply(   as.matrix(cm), title = paste("ConfusionMatrix for k = ",kval ) ) )
  print( xkabledply(data.frame(cm$byClass), title=paste("k = ",kval)) )
}
# Selected Variable Model
for (kval in 10:30) {
  online_pred_selected <- knn(train = online_training_selected, test = online_test_selected, cl=online_trainLabels_selected, k=kval)
  onlinePREDCross_selected <- CrossTable(online_testLabels_selected, online_pred_selected, prop.chisq = FALSE)
  print( paste("k = ", kval) )
  onlinePREDCross_selected
  
  cm_selected = confusionMatrix(online_pred_selected, reference = online_testLabels_selected ) 
  cmaccu_selected = cm_selected$overall['Accuracy']
  print( paste("Total Accuracy = ", cmaccu_selected ) )
   
  cmt_selected = data.frame(k=kval, Total.Accuracy = cmaccu_selected, row.names = NULL )  
  ResultDf_selected = rbind(ResultDf_selected, cmt_selected)
  print( xkabledply(   as.matrix(cm_selected), title = paste("ConfusionMatrix for k = ",kval ) ) )
  print( xkabledply(data.frame(cm_selected$byClass), title=paste("k = ",kval)) )
}
xkabledply(ResultDf, "Total Accuracy Summary")
Total Accuracy Summary
k Total.Accuracy
5 0.861
6 0.860
7 0.863
8 0.862
9 0.862
10 0.862
11 0.863
12 0.862
13 0.865
14 0.864
15 0.867
16 0.865
17 0.867
18 0.867
19 0.867
20 0.868
21 0.865
22 0.865
23 0.867
24 0.865
25 0.866
# Selected Variable Model
xkabledply(ResultDf_selected, "Total Accuracy Summary")
Total Accuracy Summary
k Total.Accuracy
10 0.887
11 0.886
12 0.887
13 0.885
14 0.884
15 0.886
16 0.883
17 0.886
18 0.887
19 0.887
20 0.887
21 0.887
22 0.888
23 0.885
24 0.885
25 0.887
26 0.888
27 0.886
28 0.887
29 0.887
30 0.884

# Accuracy Plot

kPlot = ggplot(ResultDf, aes(k, Total.Accuracy)) + geom_point(colour = "blue", size = 3) + geom_line(color='orange')
kPlot

kPlot_selected = ggplot(ResultDf_selected, aes(k, Total.Accuracy)) + geom_point(colour = "blue", size = 3) + geom_line(color='orange')
kPlot_selected

# KNN Model with Optimal K Value

# Max Accuracy
online_pred_max <- knn(train = online_training, test = online_test, cl=online_trainLabels, k=15)
onlinePREDCross_max <- CrossTable(online_testLabels, online_pred, prop.chisq = FALSE)

# Selected Variable Model
online_pred_max_selected <- knn(train = online_training_selected, test = online_test_selected, cl=online_trainLabels_selected, k=25)
onlinePREDCross_max_selected <- CrossTable(online_testLabels_selected, online_pred_selected, prop.chisq = FALSE)

# Roc Curve Plots and AUC Scores

loadPkg("pROC")
# model with every variable
h <- roc(online_testLabels , as.integer(online_pred_max))
auc(h)
plot(h)

# model with specific variables
h_selected <- roc(online_testLabels_selected , as.integer(online_pred_max_selected))
auc(h_selected)
plot(h_selected)

Sadly, both models do not meet the cut off AUC metric of 0.8. We can see from the Confusion Matrix that when the models are at the optimal k value the total accuracy are in the high 80 percent levels. However, the issue we need to address is the specificity that is measured by taking the true negatives divided by the true negatives plus the false positives. For both models we built we can see that there are small specificity rates.